Motivation—Tim

This year, a report from Centers for Disease Control and Prevention revealed that America’s obesity rate has reached a record high. Contrasting to popular belief, New Yorkers are not so safe from the obesity epidemic, as more than half of adult New Yorkers are either overweight or obese. Studies show that the rise in the obesity epidemic is partly due to disparities in food environment; it is harder for some to eat healthier because their options are limited. This project intends to look deeper into the relationship between food environment in NYC and obesity rate along with diabetes rate and stroke hospitalization rate.

Initial questions—Tim

1. Can the percentage of certain fast-food restaurant be related to obesity, diabetes and stroke rate in different neighborhoods of NYC?

2. Can the percentage of health inspection grade A restaurant be related to obesity, diabetes and stroke rate in different neighborhoods of NYC?

3. Can the composite restaurant health score of each neighborhood be related to chronic disease rate?

For the first two questions, we stick to it. For the third question, we change the main predictor to percentage of fast food cuisine type since it was easy to acquire in the data. Composite restaurant health score were to subjective and difficult to assess. Moreover, we calculated the boro level model first due to the difficulties we encountered in scaping zipcode-neighborhood data. However, after some struggle, we still managed to get the neighborhood information and analyze the data accordingly.

Data and Methods

Data Source and Collection—Tim

1. NYC restaurant inspection:

https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

2. 2015 Community Health Profiles Open Data:

https://www1.nyc.gov/site/doh/data/data-publications/profiles.page

These data were rather easy to acquire, since they can be downloaded by CSV format. Because health and demographic data are grouped by neighborhood but the restaurant inspection data was using zipcodes. We had to match the zipcode and neighborhood name in order to merge the datasets which was the hardest.

First, we scraped the table data from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. However the neighborhood name and combinations were different from the health profile data we downloaded. We cannot merge the two datasets. Then we tried to look for the raw classification for the neigborhoods in health data from New York University’s Furman Center for Real Estate and Urban Policy and the NYC Department of City Planning. However, still no luck becuase the neighborhood was not divided by zipcode areas. Then we tried the hardest way: search the name of the neighborhood on Google and finding the matching zipcodes. It worked out well, but it was not precise, thus we left out some of the restaurants without a matching neighborhood name.

Data Import and Cleaning—Tim

Downloading restaurant inspection data from NYC open data:

get_all_inspections = function(url) {
  
  all_inspections = vector("list", length = 0)
  
  loop_index = 1
  chunk_size = 50000
  DO_NEXT = TRUE
  
  while (DO_NEXT) {
    message("Getting data, page ", loop_index)
    
    all_inspections[[loop_index]] = 
      GET(url,
          query = list(`$order` = "zipcode",
                       `$limit` = chunk_size,
                       `$offset` = as.integer((loop_index - 1) * chunk_size)
                       )
          ) %>%
      content("text") %>%
      fromJSON() %>%
      as_tibble()
    
    DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
    loop_index = loop_index + 1
  }
  
  all_inspections
  
}

url = "https://data.cityofnewyork.us/resource/9w7m-hzhe.json"

rest_inspection = get_all_inspections(url) %>%
  bind_rows() 

Downloading health and demographic data CSV into local folder and reading, cleaning it:

download.file("https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2015_CHP_PUD.xlsx", mode="wb", destfile = "health.xlsx")
health <- read_excel("health.xlsx", sheet = "CHP_all_data") %>% 
  select(Name, Racewhite_Rate, Age0to17_rate, 
         Age18to24_rate, Age25to44_rate, Poverty, Unemployment,
         Smoking, Exercise,
         Obesity, Diabetes, Stroke_Hosp) %>% 
  clean_names() %>% 
  mutate(age0to44_rate = age0to17_rate + age18to24_rate + age25to44_rate) %>% 
  select(-age0to17_rate, -age18to24_rate, -age25to44_rate)

Matching the neighborhood names with restaurant zipcodes:

zip_neighbor <- read_csv("neigh_zipcode.csv") %>% 
  mutate(zipcode = as.character(zipcode))
##restaurant data with neighbourhood
rest_neighborhood = left_join(rest_inspection, zip_neighbor, by = "zipcode") %>% 
  filter(!is.na(neighborhood))

Exploratory Analysis

Health—Tim&Austin

explore_boro = health[c(2:6),] %>% 
  mutate(name = as.factor(name)) %>% 
  data.frame()

  # for obesity.
  explore_boro %>%
  mutate(name = fct_reorder(name,obesity)) %>% 
  plot_ly(x = ~name, y = ~obesity, color = ~name, type = "bar")
  # for diabetes.
  explore_boro %>% 
  mutate(name = fct_reorder(name,diabetes)) %>%
  plot_ly(x = ~name, y = ~diabetes, color = ~name, type = "bar")
  # for stroke.
  explore_boro %>% 
  mutate(name = fct_reorder(name,stroke_hosp)) %>%
  plot_ly(x = ~name, y = ~stroke_hosp, color = ~name, type = "bar")

For obesity, Manhattan has lowerest percentage. Bronx has the highest. And there are increasing obesity percentages as the order of Queens, Brooklyn and Staten Island. For diabetes, the lowest and highest are the same as those of obesity. The difference is that the percentages of Queens and Staten Island are nearly the same and a little bit lower than that of Brooklyn. For stroke, it has the same situation with that of diabetes.

health_only_neighborhood <- health[-c(1:6),] %>% 
  rename(neighborhood = name) %>% 
  mutate(neighborhood = as.factor(neighborhood))
##Plotting for outcome in different neighborhood


health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, obesity)) %>% 
  filter(obesity > 25) %>% 
  ggplot(aes(x = neighborhood,y = obesity, fill = neighborhood)) + geom_col() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 25 percent more obesity rate")

health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, diabetes)) %>% 
  filter(diabetes > 10) %>% 
  ggplot(aes(x = neighborhood,y = diabetes, fill = neighborhood)) + geom_col() +   
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 10 percent more diabetes rate")

health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, stroke_hosp)) %>% 
  filter(stroke_hosp > 300) %>% 
  ggplot(aes(x = neighborhood,y = stroke_hosp, fill = neighborhood)) + 
  geom_col() +    
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 300 more stroke hospitalization in 100,000 adults")

##Plotting for some adjusted variables


health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, poverty)) %>% 
  filter(poverty > 20) %>% 
  ggplot(aes(x = neighborhood,y = poverty, fill = neighborhood)) + geom_col() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
  labs(title = "Neighborhood with 20 percent or more poverty")

health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, age0to44_rate)) %>% 
  filter(age0to44_rate > 60) %>% 
  ggplot(aes(x = neighborhood,y = age0to44_rate, fill = neighborhood)) + 
  geom_col() +    
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
  labs(title = "Neighborhood with 60 percent or more people under age 44")

Summary: Williamsbridge and Baychester have the highest obesity rate, up to 32%. East New York and Starrett City have the highest diabetes rate, up to 17 percent. BUshwick has the highest stroke hospitalization in 100,000 adults, up to 300 adults. Morrisania and Crotona has the highest percent of poverty rate, up to 40%. Fiancial district has the highest percent of young peoeple, up to 67%.

Restaurant data

Fastfood restaurants by cuisine type—Ruiwei

# Identify fastfood restaurants by their cuisine types. We print out the cuisine type list (n=85) and let everyone circle the ones they think is fastfood and we use the union as our rule (use union because it's more conservative).

rest_inspection %>%
  mutate(dba = str_to_upper(dba)) %>%
  arrange(cuisine_description) %>%
  count(cuisine_description)
# calculating the total number of restaurants and the number of fastfood restaurants in each boro, as well as the percentage of fastfood restaurants.

name_boro = c("MANHATTAN", "BROOKLYN", "QUEENS", "BRONX", "STATEN ISLAND")

rest_fastfood = 
  rest_inspection %>%
  filter(cuisine_description %in% c("Bagels/Pretzels",
                                    "Bottled beverages, including water, sodas, juices, etc.",
                                    "Chicken",
                                    "Delicatessen",
                                    "Donuts",
                                    "Hamburgers",
                                    "Hotdogs",
                                    "Hotdogs/Pretzels",
                                    "Ice Cream, Gelato, Yogurt, Ices",
                                    "Nuts/Confectionary",
                                    "Pancakes/Waffles",
                                    "Pizza",
                                    "Soul Food",
                                    "Sandwiches",
                                    "Sandwiches/Salads/Mixed Buffet",
                                    "Soups & Sandwiches"))

percent_fastfood = function(name_boro){

  rest_boro =
    rest_inspection %>%
    filter(boro == name_boro) %>%
    distinct(camis)
  
  n_rest = nrow(rest_boro)

  rest_fastfood_distinct = 
    rest_fastfood %>%
    filter(boro == name_boro) %>%
    distinct(camis, cuisine_description)
  
  n_fastfood = nrow(rest_fastfood_distinct)
    
  percent_fastfood = n_fastfood/n_rest
  
  tibble(
    boro = name_boro,
    n_fastfood = n_fastfood,
    n_rest = n_rest,
    percent_fastfood = percent_fastfood
  )
}

fastfood_boro = 
  map(name_boro, percent_fastfood) %>%
  bind_rows()

# calculating the total number of restaurants and the number of fastfood restaurants in the neighborhood, as well as the percentage of fastfood restaurants.

neighborhood_list = 
  rest_neighborhood %>%
  distinct(neighborhood) %>%
  arrange(neighborhood)
  
rest_fastfood_neighborhood = 
  rest_neighborhood %>%
  filter(cuisine_description %in% c("Bagels/Pretzels",
                                    "Bottled beverages, including water, sodas, juices, etc.",
                                    "Chicken",
                                    "Delicatessen",
                                    "Donuts",
                                    "Hamburgers",
                                    "Hotdogs",
                                    "Hotdogs/Pretzels",
                                    "Ice Cream, Gelato, Yogurt, Ices",
                                    "Nuts/Confectionary",
                                    "Pancakes/Waffles",
                                    "Pizza",
                                    "Soul Food",
                                    "Sandwiches",
                                    "Sandwiches/Salads/Mixed Buffet",
                                    "Soups & Sandwiches"))

percent_fastfood_neighborhood = function(name_neighborhood){

  rest_each_neighborhood =
    rest_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis)
  
  n_rest_neighborhood = nrow(rest_each_neighborhood)

  rest_fastfood_distinct_neighborhood = 
    rest_fastfood_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis, cuisine_description)
  
  n_fastfood_neighborhood = nrow(rest_fastfood_distinct_neighborhood)
    
  percent_fastfood_neighborhood = n_fastfood_neighborhood/n_rest_neighborhood
  
  tibble(
    neighborhood = name_neighborhood,
    n_fastfood = n_fastfood_neighborhood,
    n_rest = n_rest_neighborhood,
    percent_fastfood = percent_fastfood_neighborhood
  )
}

fastfood_neighborhood = 
  map(neighborhood_list$neighborhood, percent_fastfood_neighborhood) %>%
  bind_rows() %>%
  mutate(neighborhood = str_to_upper(neighborhood))
fastfood_boro %>%
  mutate(boro = as.factor(boro),
         n_rest = as.numeric(n_rest),
         boro = fct_reorder(boro, percent_fastfood)) %>%
  plot_ly(., x = ~boro, y = ~n_rest, type = 'bar', name = 'total restaurants') %>%
  add_trace(y = ~n_fastfood, name = 'fastfood restaurants') %>%
  layout(yaxis = list(title = 'number of restaurants'),
         xaxis = list(title = 'borough (ordered by percentage of fastfood restaurants)'),
         barmode = 'stack')

From the plot, as we anticipated, among five boroughs, Manhanttan has the largest number of restaurants but smallest percentage of fastfood restaurants. followed by Brooklyn. While Staten Island has the smallest number of total restaurants and Bronx has the largest percentage of fastfood restaurants.

fastfood_neighborhood %>%
  mutate(neighborhood = as.factor(neighborhood),
         n_rest = as.numeric(n_rest),
         neighborhood = fct_reorder(neighborhood, percent_fastfood)) %>%
  plot_ly(., x = ~neighborhood, y = ~n_rest, type = 'bar', name = 'total restaurants') %>%
  add_trace(y = ~n_fastfood, name = 'fastfood restaurants') %>%
  layout(yaxis = list(title = 'number of restaurants'), 
         xaxis = list(title = 'neighborhood (ordered by percentage of fastfood restaurants)',
                      tickangle = 45),
         barmode = 'stack')

From the plot, we can see that, while the Greenwich Village and SOHO neighborhood has fairly large number of restaurants, it has the smallest percentage of fastfood restaurants. Williamsbridge and Baychester has the largest percentage of fastfood restaurants.

When large number of total restaurants is not equal to large percentage of fastfood restaurants in that neighborhood, we can conclude that the distribution of fastfood restaurants is not even across neighborhoods, which also implies the motivation of our study, we want to investigate if this uneven distribution of fastfood restaurants is associated with diffferent level of chronic disease outcomes within a neighborhood.

Restaurant Chains—Sara

## Matching list of chain restaurants in U.S. and restaurant inspection data

chains_html = read_html("https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual")

chain_rest = chains_html %>%
  html_nodes("ul:nth-child(9) li , .jquery-tablesorter tr:nth-child(1) td:nth-child(1)") %>%
  html_text() %>%
  as.tibble() %>%
  mutate(dba = value, 
         dba = str_to_upper(dba)) %>%
  select(dba)
# read in the list of chain restaurants in us 
# made the names to uppercase and changed the var name to dba
head(chain_rest, 10)
## # A tibble: 10 x 1
##                       dba
##                     <chr>
##  1        A&W RESTAURANTS
##  2             APPLEBEE'S
##  3             BAJA FRESH
##  4          BOSTON MARKET
##  5     BUFFALO WILD WINGS
##  6                CHILI'S
##  7 CHIPOTLE MEXICAN GRILL
##  8           CICI'S PIZZA
##  9    COLD STONE CREAMERY
## 10     CORNER BAKERY CAFE

I have scraped list of 75 national chain restaurants from the wikipedia page (https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual) now I will join this dataset with restaurant inspection data to choose only the chain restaurants in NYC.

# removing punctuations in chain_rest & restaurant inspections
chain_rest_str = 
  chain_rest  %>%
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

rest_inspect_str = rest_inspection %>% 
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

# Matching the two datasets(restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
nyc_chain = 
  right_join(rest_inspect_str, chain_rest_str) %>%
  filter(!is.na(camis)) %>%
  distinct(camis, dba, boro)
## Joining, by = "dba"
nyc_chain %>%
  group_by(dba) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 2
##                       dba     n
##                     <chr> <int>
##  1          DUNKIN DONUTS   453
##  2                 SUBWAY   344
##  3              STARBUCKS   282
##  4              MCDONALDS   207
##  5 CHIPOTLE MEXICAN GRILL    66
##  6                 WENDYS    42
##  7              APPLEBEES    28
##  8           WHITE CASTLE    25
##  9          BOSTON MARKET    22
## 10                   IHOP    21

The combined dataframe, nyc_chain has nrow(nyc_chain) observations. Also, there were 33 different chain restaurants extracted. The chart is showing the 10 chain restaurants in descending order. Next step is calculating the percentage of chain restaurants in each boro.

# counting chains in boro

count_chain = nyc_chain %>%
  group_by(boro) %>%
  summarise(chain_n = n())

count_rest = rest_inspection %>%
  distinct(boro, camis) %>%
  group_by(boro) %>%
  summarise(res_n = n())

# calculating percentage of chains in each boro
percent_chain = left_join(count_chain, count_rest) %>%
  mutate(chain_percentage = chain_n/res_n)
## Joining, by = "boro"
# let's see this in a graph
percent_chain %>%
  mutate(boro = forcats::fct_reorder(boro, chain_percentage)) %>%
   ggplot(aes(boro, chain_percentage, fill = boro)) + geom_bar(stat="identity")

We can see that Brooklyn has the lowest percentage of chain restauratns with percentage less that 5%. Next was Manhattan with 6% and the highest was Staten Island with 10% of chain restaurants.

Inspection Grade—Austin

data =
  rest_inspection %>%
  group_by(boro, grade) %>%
  summarise(n = n()) %>%
  mutate(grade_percent = n / sum(n)) %>% 
  filter(grade == "A", boro!="Missing") %>% 
  ungroup(boro)
data %>% 
  mutate(boro = as.factor(boro),
         boro = fct_reorder(boro,grade_percent)) %>% 
  plot_ly(x = ~boro, y = ~grade_percent, color = ~boro, type = "bar")

From the hisotgram, we can see that there is little difference between five boroughs about the percentage of “grade-A” restaurants in each borough.

data1 =
  rest_neighborhood %>%
  group_by(neighborhood, grade) %>%
  summarise(n = n()) %>%
  mutate(grade_percent = n / sum(n)) %>% 
  filter(grade == "A") %>% 
  ungroup(boro)
data1 %>% 
  mutate(neighborhood = as.factor(neighborhood),
         neighborhood = fct_reorder(neighborhood,grade_percent)) %>% 
  plot_ly(x = ~neighborhood, y = ~grade_percent, color = ~neighborhood, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Different from the situation in borough, the difference of percentage of “grade-A” restaurants in each borough exists. “Throgs Neck and Co-op City” has the greatest grade-A restaurants percentage, around 44.50%. “Sunset Park”, however, has the least, around 30.62%. “Washington Heights”, where we live, takes the fourth from the end, 34.39%, which is obviously consistent with the feeling we have towards the restaurant condition of “Washington Heights”.

Formal Analysis and Findings

Borough level

Fastfood restaurants by cuisine type—Ruiwei

# combine the dataset with health data and run linear simple regression models

health_boro = 
  health %>%
  mutate(boro = str_to_upper(name)) %>%
  select(-name)

combined_boro = 
  left_join(fastfood_boro, health_boro, by = "boro")

# fit single linear regression model between the three health outcomes and percentage of fastfood restaurants

lm(obesity ~ percent_fastfood, combined_boro) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 4.299516 8.665596 0.4961593 0.6538565
percent_fastfood 109.794316 45.332608 2.4219722 0.0940027
lm(diabetes ~ percent_fastfood, combined_boro) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 2.993526 4.175494 0.7169273 0.5251950
percent_fastfood 39.666809 21.843393 1.8159637 0.1669923
lm(stroke_hosp ~ percent_fastfood, combined_boro) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 213.5587 79.28017 2.693722 0.0741796
percent_fastfood 568.9957 414.74085 1.371931 0.2636765

According to the results of single linear regressions, we can conclude that there is no significant relationships between the three chronic disease outcomes and the percentage of fastfood restuarants within boroughs. (Obesity: p-value=0.09400275; Diabetes: p-value=0.1669923; Hospitalized Stroke: p-value=0.26367653)

Restaurant Chains—Sara

health_join = health %>% mutate(boro = str_to_upper(name))
health_chain_join = left_join(percent_chain, health_join, by = "boro") 
# function for modeling health outcome and percentage of chain restaurants

lm_sum = function(health_outcome){
  
 graphs_chain_health = health_chain_join %>%
  mutate(boro = forcats::fct_reorder(boro, health_outcome)) %>%
  ggplot(aes(chain_percentage, health_outcome, color = boro)) + geom_point() 
 
 reg_chain = lm(health_outcome ~ chain_percentage, data = health_chain_join) %>%
   summary()

list(graphs_chain_health, reg_chain)
# function that takes a input of column in health_chain_join and gives output of
 
# 1. graphs_health will give a graph showing the relationship between health outcomes in boroughs and their percentage of chain restaurants

# 2.reg_chain gives an output of the linear regression result
}
# model obesity
lm_sum(health_chain_join$obesity)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage, data = health_chain_join)
## 
## Residuals:
##       1       2       3       4       5 
##  3.5555  6.1694 -6.6653 -2.5681 -0.4915 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        14.012      9.198   1.523    0.225
## chain_percentage  146.365    119.681   1.223    0.309
## 
## Residual standard error: 5.83 on 3 degrees of freedom
## Multiple R-squared:  0.3327, Adjusted R-squared:  0.1102 
## F-statistic: 1.496 on 1 and 3 DF,  p-value: 0.3087

The chain_percentage-health_outcome graph shows the relationship between obesity and borough’s percetage of chain restaurants. It shows a slightly positive linear trend except for Brooklyn, which has the lowest percentage of health restaurants but third highest in percentage of obesity rate.

In the linear regression analysis,the p-value for the slope coeffcient, is 0.309, indicating that 0.05 significance level, there is no linear association between percentage of obesity in boroughs and their percentage of chain restaurants.

# model diabetes
lm_sum(health_chain_join$diabetes)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage, data = health_chain_join)
## 
## Residuals:
##        1        2        3        4        5 
##  2.92305  1.61610 -2.85354 -0.08466 -1.60095 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)         7.638      4.260   1.793    0.171
## chain_percentage   37.467     55.426   0.676    0.547
## 
## Residual standard error: 2.7 on 3 degrees of freedom
## Multiple R-squared:  0.1322, Adjusted R-squared:  -0.1571 
## F-statistic: 0.457 on 1 and 3 DF,  p-value: 0.5475

Percentage of diabetes and chain restaurants seems to have a lower association when visualizing the relationship. In the linear regression analysis,the p-value for the slope coeffcient, is 0.546, indicating that 0.05 significance level, there is no linear association between percentage of diabetes in boroughs and their percentage of chain restaurants.

# model stroke
lm_sum(health_chain_join$stroke_hosp)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage, data = health_chain_join)
## 
## Residuals:
##      1      2      3      4      5 
##  47.61  35.60 -49.67 -11.26 -22.27 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        288.82      74.14   3.896    0.030 *
## chain_percentage   420.32     964.59   0.436    0.692  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.99 on 3 degrees of freedom
## Multiple R-squared:  0.05952,    Adjusted R-squared:  -0.254 
## F-statistic: 0.1899 on 1 and 3 DF,  p-value: 0.6925

Percentage of diabetes and chain restaurants seems to have a lower association when visualizing the relationship. In the linear regression analysis,the p-value for the slope coeffcient, is 0.546, indicating that 0.05 significance level, there is no linear association between percentage of diabetes in boroughs and their percentage of chain restaurants. However, this was expected as the size of the data is only 6. Further research is needed.

Inspection Grade—Austin

combined = 
  left_join(data, health_boro, by = "boro") %>% 
  filter(boro!="Missing")

First, combine the data which we substract from the data of each borough of our “health” data with the dataset which shows the percentage of “grade-A” restaurants in each borough. Then we acquire the final dataset we can do the linear regression between obesity, diabetes and stroke with the percentage of “grade-A” restaurants, which we treat as a represent of the healthy restaurants in that borough.

reg1 <- lm(obesity ~ grade_percent, data = combined )
summary(reg1)
## 
## Call:
## lm(formula = obesity ~ grade_percent, data = combined)
## 
## Residuals:
##      1      2      3      4      5 
##  6.134  2.324 -8.812 -3.826  4.180 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     21.182    177.658   0.119    0.913
## grade_percent    9.227    453.050   0.020    0.985
## 
## Residual standard error: 7.136 on 3 degrees of freedom
## Multiple R-squared:  0.0001382,  Adjusted R-squared:  -0.3331 
## F-statistic: 0.0004148 on 1 and 3 DF,  p-value: 0.985
broom::tidy(reg1)
##            term  estimate std.error statistic   p.value
## 1   (Intercept) 21.182324  177.6582 0.1192307 0.9126286
## 2 grade_percent  9.227001  453.0501 0.0203664 0.9850299

p-value is way much higher than 0.05, fail to reject the null, meaning that there is no linear association between the obesity and the percentage of grade A restaurant in each borough.

reg2 <- lm(diabetes ~ grade_percent, data = combined )
summary(reg2)
## 
## Call:
## lm(formula = diabetes ~ grade_percent, data = combined)
## 
## Residuals:
##       1       2       3       4       5 
##  3.2686  1.2209 -3.4596 -0.5280 -0.5019 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -7.646     71.396  -0.107    0.921
## grade_percent   46.027    182.070   0.253    0.817
## 
## Residual standard error: 2.868 on 3 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  -0.3055 
## F-statistic: 0.06391 on 1 and 3 DF,  p-value: 0.8168
broom::tidy(reg2)
##            term  estimate std.error  statistic   p.value
## 1   (Intercept) -7.646226  71.39646 -0.1070953 0.9214736
## 2 grade_percent 46.027498 182.06966  0.2528016 0.8167539

p-value is way much higher than 0.05, fail to reject the null, meaning that there is no linear association between the diabetes and the percentage of grade A restaurant in each borough.

reg3 <- lm(stroke_hosp ~ grade_percent, data = combined )
summary(reg3)
## 
## Call:
## lm(formula = stroke_hosp ~ grade_percent, data = combined)
## 
## Residuals:
##       1       2       3       4       5 
##  56.999  20.830 -55.476 -14.105  -8.247 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)      417.8     1204.9   0.347    0.752
## grade_percent   -249.8     3072.6  -0.081    0.940
## 
## Residual standard error: 48.4 on 3 degrees of freedom
## Multiple R-squared:  0.002199,   Adjusted R-squared:  -0.3304 
## F-statistic: 0.006612 on 1 and 3 DF,  p-value: 0.9403
broom::tidy(reg3)
##            term  estimate std.error   statistic   p.value
## 1   (Intercept)  417.7594  1204.893  0.34671918 0.7516972
## 2 grade_percent -249.8486  3072.622 -0.08131444 0.9403130

p-value is way much higher than 0.05, fail to reject the null, meaning that there is no linear association between the stroke and the percentage of grade A restaurant in each borough.
According to the three linear regression analysis above, we can conclude that there is no linear relationship between obesity, diabetes and stroke with the “grade-A” restaurants in each borough.

Neighborhood level

Fastfood restaurants by cuisine type—Ruiwei

# combine the dataset with health data and run linear simple regression models

health_neighborhood = 
  health %>%
  mutate(neighborhood = str_to_upper(name)) %>%
  select(-name)

combined_neighborhood = 
  left_join(fastfood_neighborhood, health_neighborhood, by = "neighborhood")

# fit multiple linear regression model between the three health outcomes and percentage of fastfood restaurants after adjusting fro race, age, poverty, smoking status and exercise status

lm(obesity ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 25.8509571 13.6572083 1.8928434 0.0639505
percent_fastfood 48.6653415 13.8293772 3.5189829 0.0009095
racewhite_rate -0.1076851 0.0405338 -2.6566759 0.0104552
age0to44_rate -0.0877154 0.1529590 -0.5734566 0.5688075
poverty 0.1154984 0.1170059 0.9871161 0.3281567
smoking 0.7577658 0.2478180 3.0577520 0.0035184
exercise -0.2087047 0.1562323 -1.3358614 0.1874119
lm(diabetes ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 17.5562374 5.7069140 3.0763101 0.0033386
percent_fastfood 17.5359140 5.7788580 3.0344947 0.0037566
racewhite_rate -0.0779147 0.0169378 -4.6000539 0.0000274
age0to44_rate 0.0054401 0.0639167 0.0851118 0.9324993
poverty 0.0443665 0.0488930 0.9074204 0.3683707
smoking 0.1532679 0.1035553 1.4800593 0.1448910
exercise -0.1467238 0.0652845 -2.2474519 0.0288798
lm(stroke_hosp ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) -0.0277768 143.238809 -0.0001939 0.9998460
percent_fastfood 438.0215920 145.044542 3.0199109 0.0039135
racewhite_rate -1.5958654 0.425124 -3.7538819 0.0004404
age0to44_rate 0.2560030 1.604257 0.1595773 0.8738323
poverty 1.7937921 1.227175 1.4617250 0.1498338
smoking 6.4669229 2.599151 2.4880903 0.0160854
exercise 1.6667802 1.638587 1.0172057 0.3137651

According to the results of multiple linear regressions, we can conclude that after adjusting for race, age, poverty, smoking status and exercise status, there is significant relationships between the three chronic disease outcomes and the percentage of fastfood restuarants within boroughs. (Obesity: p-value=0.000909491; Diabetes: p-value=3.756556e-03; Hospitalized Stroke: p-value=0.0039134849)

Restaurant Chains—Sara

# removing punctuations in restaurant inspections
rest_neigh_str = rest_neighborhood %>% 
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

# Matching the two datasets(neighborhood restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
neighborhood_chain = 
  right_join(rest_neigh_str, chain_rest_str) %>%
  filter(!is.na(camis)) %>%
  distinct(camis, dba, neighborhood)
## Joining, by = "dba"
# counting chains in neighborhoods

neigh_count_chain = neighborhood_chain %>%
  group_by(neighborhood) %>%
  summarise(chain_n = n())

neigh_count_rest = rest_neighborhood %>%
  distinct(neighborhood, camis) %>%
  group_by(neighborhood) %>%
  summarise(res_n = n())

# calculating percentage of chains in each boro
percent_neighborhood_chain = left_join(neigh_count_chain, neigh_count_rest) %>%
  mutate(chain_percentage = chain_n/res_n)
## Joining, by = "neighborhood"
# joinining with health data
health_chain_neighborhood = left_join(percent_neighborhood_chain, health_only_neighborhood) 
## Joining, by = "neighborhood"
## Warning: Column `neighborhood` joining character vector and factor,
## coercing into character vector
# function for modeling health outcome and percentage of chain restaurants

lm_neighborhood = function(health_outcome){
  
 graphs_chain_health = health_chain_neighborhood %>%
  mutate(boro = forcats::fct_reorder(neighborhood, health_outcome)) %>%
  ggplot(aes(chain_percentage, health_outcome, color = neighborhood)) + geom_point() + theme(legend.position="none")
 
 reg_chain = lm(health_outcome ~ chain_percentage + racewhite_rate + age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood) %>%
   summary()

list(graphs_chain_health, reg_chain)
# function that takes a input of column in health_chain_neighborhood and gives output of:
 
# 1. graphs_health will give a graph showing the relationship between health outcomes in neighborhoods and their percentage of chain restaurants

# 2.reg_chain gives an output of the linear regression result
}
lm_neighborhood(health_chain_neighborhood$obesity)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate + 
##     age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7491  -2.9746  -0.8985   3.7030   8.7543 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      27.47676   14.43271   1.904  0.06248 . 
## chain_percentage 62.71403   23.40020   2.680  0.00984 **
## racewhite_rate   -0.13307    0.04047  -3.288  0.00181 **
## age0to44_rate    -0.08829    0.16063  -0.550  0.58493   
## poverty           0.17523    0.12586   1.392  0.16978   
## smoking           0.82925    0.25615   3.237  0.00210 **
## exercise         -0.19116    0.16350  -1.169  0.24767   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.926 on 52 degrees of freedom
## Multiple R-squared:  0.6606, Adjusted R-squared:  0.6215 
## F-statistic: 16.87 on 6 and 52 DF,  p-value: 1.076e-10
# I tried log transformation and it's even better!

#health_chain_neighborhood %>%
#  mutate(boro = forcats::fct_reorder(neighborhood, obesity)) %>%
#  ggplot(aes(chain_percentage, obesity^2, color = neighborhood)) + geom_point() + #theme(legend.position="none")
#
#lm(obesity^2 ~ chain_percentage + racewhite_rate + age0to44_rate + poverty + smoking + #exercise, data = health_chain_neighborhood) %>%
#   summary()

Fitting a regression line with predictors chain_percentage, racewhite_rate, age0to44_rate, poverty,smoking, and exercise, we can see that the percentage of chains is a significant predictor with p-value, 0.01057. Holding all the other predictors constant, 62.2% increase in percentage of obesity in NYC neighborhoods is associated with 1% increase in chain restaurants. From the graph of percentage of chain restaurants and percentage of obesity, we can see there is a positive trend.

lm_neighborhood(health_chain_neighborhood$diabetes)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate + 
##     age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0827 -1.1388 -0.1943  1.3611  4.4671 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.208130   5.976560   3.047  0.00363 ** 
## chain_percentage 22.293746   9.689984   2.301  0.02545 *  
## racewhite_rate   -0.087230   0.016760  -5.205 3.35e-06 ***
## age0to44_rate     0.004829   0.066516   0.073  0.94240    
## poverty           0.065477   0.052119   1.256  0.21462    
## smoking           0.179708   0.106070   1.694  0.09620 .  
## exercise         -0.140599   0.067707  -2.077  0.04280 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.04 on 52 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7291 
## F-statistic: 27.01 on 6 and 52 DF,  p-value: 2.329e-14

We can also visually see there is positive correlation between diabetes and percentage of chains. From the linear regression analysis, the results show that percentage of chain restaurants in neighborhoods is a significant predictor of percentage of diabetics in NYC neighborhoods

lm_neighborhood(health_chain_neighborhood$stroke_hosp)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate + 
##     age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -134.596  -30.579    2.121   25.492  141.948 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        81.769    155.825   0.525  0.60199    
## chain_percentage  254.946    252.643   1.009  0.31759    
## racewhite_rate     -1.994      0.437  -4.562 3.12e-05 ***
## age0to44_rate      -0.161      1.734  -0.093  0.92640    
## poverty             1.912      1.359   1.407  0.16528    
## smoking             7.804      2.765   2.822  0.00675 ** 
## exercise            1.626      1.765   0.921  0.36132    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.19 on 52 degrees of freedom
## Multiple R-squared:  0.6403, Adjusted R-squared:  0.5988 
## F-statistic: 15.43 on 6 and 52 DF,  p-value: 4.587e-10

When fitting chain_percentage, racewhite_rate, age0to44_rate, poverty,smoking, and exercise on rate of stroke however, the outcomes indicates that chain_percentage is not a significant predictor at significance level 0.05.

Inspection Grade—Austin

data2 =
  data1 %>% 
  mutate(neighborhood = str_to_upper(neighborhood))

combined1 = 
  left_join(data2, health_neighborhood, by = "neighborhood")

First, combine the data which shows the health data of each neighborhood with the dataset which shows the percentage of “grade-A” restaurants in each neighborhood. Then we acquire the final dataset, we can do the linear regression between obesity, diabetes and stroke with the percentage of “grade-A” restaurants, which we treat as a represent of the healthy restaurants in that neighborhood.

reg4 <- lm(obesity ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + diabetes + stroke_hosp + age0to44_rate, data = combined1 )
summary(reg4)
## 
## Call:
## lm(formula = obesity ~ grade_percent + racewhite_rate + poverty + 
##     unemployment + smoking + exercise + diabetes + stroke_hosp + 
##     age0to44_rate, data = combined1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1713  -2.0274  -0.3113   2.0525   8.1006 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.54617   12.87625   0.741  0.46200    
## grade_percent  -17.60339   20.58065  -0.855  0.39653    
## racewhite_rate   0.06968    0.04336   1.607  0.11444    
## poverty         -0.12634    0.11185  -1.130  0.26419    
## unemployment     0.54626    0.30516   1.790  0.07962 .  
## smoking          0.38692    0.20430   1.894  0.06415 .  
## exercise        -0.11207    0.12817  -0.874  0.38615    
## diabetes         1.25144    0.24981   5.010 7.49e-06 ***
## stroke_hosp      0.02974    0.01092   2.722  0.00895 ** 
## age0to44_rate   -0.06756    0.12009  -0.563  0.57631    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.616 on 49 degrees of freedom
## Multiple R-squared:  0.8277, Adjusted R-squared:  0.796 
## F-statistic: 26.15 on 9 and 49 DF,  p-value: 8.788e-16
broom::tidy(reg4)
##              term     estimate   std.error  statistic      p.value
## 1     (Intercept)   9.54616877 12.87625409  0.7413778 4.620030e-01
## 2   grade_percent -17.60338930 20.58065261 -0.8553368 3.965278e-01
## 3  racewhite_rate   0.06968106  0.04335632  1.6071718 1.144423e-01
## 4         poverty  -0.12633617  0.11185135 -1.1295007 2.641859e-01
## 5    unemployment   0.54625678  0.30515729  1.7900827 7.962051e-02
## 6         smoking   0.38691930  0.20429633  1.8939121 6.414785e-02
## 7        exercise  -0.11207145  0.12816592 -0.8744247 3.861528e-01
## 8        diabetes   1.25143941  0.24981162  5.0095324 7.487196e-06
## 9     stroke_hosp   0.02974203  0.01092482  2.7224286 8.950648e-03
## 10  age0to44_rate  -0.06755748  0.12009280 -0.5625440 5.763102e-01

p-value is smaller than 0.05, reject the null, meaning that there is a linear association between the obesity and the percentage of grade A restaurant in each neighborhood.

reg5 <- lm(diabetes ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + obesity + stroke_hosp + age0to44_rate, data = combined1 )
summary(reg5)
## 
## Call:
## lm(formula = diabetes ~ grade_percent + racewhite_rate + poverty + 
##     unemployment + smoking + exercise + obesity + stroke_hosp + 
##     age0to44_rate, data = combined1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4327 -1.0391 -0.0606  0.7558  4.3134 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.310450   5.930227   1.233 0.223553    
## grade_percent  15.771539   9.375106   1.682 0.098878 .  
## racewhite_rate -0.071580   0.017983  -3.980 0.000227 ***
## poverty         0.058420   0.052023   1.123 0.266926    
## unemployment   -0.149733   0.144907  -1.033 0.306534    
## smoking        -0.029092   0.098334  -0.296 0.768594    
## exercise       -0.083631   0.058866  -1.421 0.161731    
## obesity         0.270641   0.054025   5.010 7.49e-06 ***
## stroke_hosp    -0.001235   0.005448  -0.227 0.821658    
## age0to44_rate   0.011424   0.056004   0.204 0.839216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.682 on 49 degrees of freedom
## Multiple R-squared:  0.8445, Adjusted R-squared:  0.8159 
## F-statistic: 29.56 on 9 and 49 DF,  p-value: < 2.2e-16
broom::tidy(reg5)
##              term     estimate   std.error  statistic      p.value
## 1     (Intercept)  7.310450314 5.930226647  1.2327438 2.235533e-01
## 2   grade_percent 15.771539488 9.375106261  1.6822785 9.887779e-02
## 3  racewhite_rate -0.071580076 0.017983053 -3.9804184 2.271823e-04
## 4         poverty  0.058420009 0.052023138  1.1229620 2.669258e-01
## 5    unemployment -0.149733349 0.144907241 -1.0333048 3.065344e-01
## 6         smoking -0.029092361 0.098334403 -0.2958513 7.685938e-01
## 7        exercise -0.083631422 0.058865519 -1.4207200 1.617307e-01
## 8         obesity  0.270640770 0.054025156  5.0095324 7.487196e-06
## 9     stroke_hosp -0.001234733 0.005448344 -0.2266253 8.216579e-01
## 10  age0to44_rate  0.011423635 0.056004414  0.2039774 8.392160e-01

p-value smaller than 0.05, reject the null, meaning that there is a linear association between the diabetes and the percentage of grade A restaurant in each neighborhood.

reg6 <- lm(stroke_hosp ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + diabetes + obesity + age0to44_rate, data = combined1 )
summary(reg6)
## 
## Call:
## lm(formula = stroke_hosp ~ grade_percent + racewhite_rate + poverty + 
##     unemployment + smoking + exercise + diabetes + obesity + 
##     age0to44_rate, data = combined1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.125 -25.090   2.113  22.378 129.566 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -8.4500   157.7977  -0.054  0.95751   
## grade_percent  -267.7030   249.7739  -1.072  0.28907   
## racewhite_rate   -0.6924     0.5330  -1.299  0.20001   
## poverty          -0.8023     1.3760  -0.583  0.56251   
## unemployment      8.7431     3.6298   2.409  0.01981 * 
## smoking           3.5192     2.5298   1.391  0.17048   
## exercise          1.8547     1.5517   1.195  0.23773   
## diabetes         -0.8480     3.7418  -0.227  0.82166   
## obesity           4.4175     1.6226   2.722  0.00895 **
## age0to44_rate     1.1773     1.4586   0.807  0.42348   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.07 on 49 degrees of freedom
## Multiple R-squared:  0.7673, Adjusted R-squared:  0.7246 
## F-statistic: 17.95 on 9 and 49 DF,  p-value: 1.071e-12
broom::tidy(reg6)
##              term     estimate   std.error   statistic     p.value
## 1     (Intercept)   -8.4500282 157.7976975 -0.05354976 0.957511620
## 2   grade_percent -267.7030180 249.7739096 -1.07178135 0.289068842
## 3  racewhite_rate   -0.6924424   0.5330362 -1.29905320 0.200005306
## 4         poverty   -0.8023331   1.3760135 -0.58308520 0.562510062
## 5    unemployment    8.7431311   3.6297859  2.40871813 0.019814072
## 6         smoking    3.5192296   2.5298354  1.39109035 0.170482667
## 7        exercise    1.8546566   1.5516569  1.19527494 0.237732597
## 8        diabetes   -0.8479942   3.7418330 -0.22662534 0.821657867
## 9         obesity    4.4174723   1.6226219  2.72242857 0.008950648
## 10  age0to44_rate    1.1773397   1.4586409  0.80714841 0.423481733

Surprise! We get an insignificant variable! P-value is higher than 0.05, suggesting there is no linear association between stroke and the percentage of grade A restaurant in each neighborhood.

Conclusion—to do later

write

Discussion—to do later

write

Additional analysis/sensitivity analysis—Ruiwei

rest_healthyfood_neighborhood = 
  rest_neighborhood %>%
  filter(cuisine_description %in% c("Fruits/Vegetables",
                                    "Juice, Smoothies, Fruit Salads",
                                    "Salads",
                                    "Vegetarian",
                                    "Soups")) 

percent_healthyfood_neighborhood = function(name_neighborhood){

  rest_each_neighborhood =
    rest_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis)
  
  n_rest_neighborhood = nrow(rest_each_neighborhood)

  rest_healthyfood_distinct_neighborhood = 
    rest_healthyfood_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis, cuisine_description)
  
  n_healthyfood_neighborhood = nrow(rest_healthyfood_distinct_neighborhood)
    
  percent_healthyfood_neighborhood = n_healthyfood_neighborhood/n_rest_neighborhood
  
  tibble(
    neighborhood = name_neighborhood,
    n_healthyfood = n_healthyfood_neighborhood,
    n_rest = n_rest_neighborhood,
    percent_healthyfood = percent_healthyfood_neighborhood
  )
}

healthyfood_neighborhood = 
  map(neighborhood_list$neighborhood, percent_healthyfood_neighborhood) %>%
  bind_rows() %>%
  mutate(neighborhood = str_to_upper(neighborhood))

healthyfood_neighborhood %>%
  mutate(neighborhood = as.factor(neighborhood),
         n_rest = as.numeric(n_rest),
         neighborhood = fct_reorder(neighborhood, n_rest)) %>%
  plot_ly(., x = ~neighborhood, y = ~n_rest, type = 'bar', name = 'total restaurants') %>%
  add_trace(y = ~n_healthyfood, name = 'healthyfood restaurants') %>%
  layout(yaxis = list(title = 'number of restaurants'), barmode = 'stack')